Arabic Diacritic Recovery Using a Feature-rich biLSTM Model
نویسندگان
چکیده
Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them correctly pronounce words. There two types of diacritics: The first core-word diacritics (CW), which specify the lexical selection, second case endings (CE), appear at end word stems generally their syntactic roles. Recovering CEs is relatively harder than recovering due inter-word dependencies, often distant. In this article, we use feature-rich recurrent neural network model that a variety linguistic surface-level features recover both core endings. Our surpasses all previous state-of-the-art systems with CW error rate (CWER) 2.9% CE (CEER) 3.7% for Modern Standard (MSA) CWER 2.2% CEER 2.5% Classical (CA). When combining diacritized cores endings, resultant rates 6.0% 4.3% MSA CA, respectively. This highlights effectiveness feature engineering such deep models.
منابع مشابه
Arabic Text Representation using Rich Semantic Graph: A Case Study
Representing Arabic Text semantically using Rich Semantic Graph (RSG) is one of the recent techniques that facilitate the process of manipulating the Arabic Language in Natural Language Processing (NLP) field. The work presented in this paper is a part of an ongoing research to create an abstractive summary for a single input document in Arabic Language. The abstractive summary is generated thr...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملAnswer Selection in Arabic Community Question Answering: A Feature-Rich Approach
The task of answer selection in community question answering consists of identifying pertinent answers from a pool of user-generated comments related to a question. The recent SemEval-2015 introduced a shared task on community question answering, providing a corpus and evaluation scheme. In this paper we address the problem of answer selection in Arabic. Our proposed model includes a manifold o...
متن کاملa simple circuit model showing feature-rich Bogdanov-Takens bifurcation
A circuit model is proposed for studying the global behavior of the normal form describing the Bogdanov-Takens bifurcation, which is encountered in the study of autonomous dynamical systems arising in different branches of science and engineering. The circuit is easy-to-implement and one can experimentally study the rich dynamics and bifurcations simply by altering the values of some linear cir...
متن کاملA Feature-Rich Constituent Context Model for Grammar Induction
We present LLCCM, a log-linear variant of the constituent context model (CCM) of grammar induction. LLCCM retains the simplicity of the original CCM but extends robustly to long sentences. On sentences of up to length 40, LLCCM outperforms CCM by 13.9% bracketing F1 and outperforms a right-branching baseline in regimes where CCM does not.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: ACM Transactions on Asian and Low-Resource Language Information Processing
سال: 2021
ISSN: ['2375-4699', '2375-4702']
DOI: https://doi.org/10.1145/3434235